ORLA 6541 - Exercise 1
The following code and write-up contains answers to in-text exercises in Wickham & Grolemund (2017): Chapters 3-6 and Bulut & Dejardins (2019) Chapter 4.
A blank grey square
library(tidyverse)
ggplot(data=mpg)
234 rows and 11 columns
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
The drv variable describes the type of drive train for each car in the data set, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd.
?mpg
ggplot(data = mpg)+geom_point(mapping=aes(x=cyl, y=hwy))
Since both variables are categorical, the points align based on the intersection of both categories, which does not provide any useful information. Scatterplots are more useful for mapping the relationship between two continuous variables.
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
## 3.3.1 Exercises #### 1. What’s gone wrong with this code? Why are the
points not blue? ggplot(data = mpg) + geom_point(mapping = aes(x =
displ, y = hwy, color = “blue”))
aes() is used to map an aesthetic to a variable. In this case, we wanted to associate the aesthetic color and with the actual color “blue”. Since “blue” is not a variable, we cannot map it in aesthetic, and had to set the aesthetic properties of our geom manually (outside of aes() . As such:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
?mpg
Categorical: manufacturer, model, trans, drv, fl, class Continuous: displ, year, cyl, cty, hwy
You can see this information when running ?mpg under each variable’s name
Assigning color to a continuous variable provides a numerical scale of colors while assigning color to a categorical variable categorizes all levels of the variable by different colors.
Similarly, assigning size to a continuous variable provides a numerical scale of sizes, while assigning size to a categorical variable (although not advised), categorizes all levels of the variable by different sizes.
Continuous variables can not be mapped to shape. While assigning shape to a categorical variable, all levels of the variable by different shapes (with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = cty))
You get a scatterpllot with an overlap of two aesthetics, both convey the same information.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = cty, alpha=cty))
It is used to modify the width of the border of shapes with borders.
?geom_point
It categorizes the variable assigned to the x-axis (in this case, displ) to < or > 5 by two different colors.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = displ<5))
We get a new column for each value of the variable.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ cty, nrow = 2)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
The empty cells in
facet_grid suggest that the variables
did not intersect within these values. This relates to the first plot
because we can notice that in the first plot, there is much empty space
and if we were to facet it, we are likely to have empty cells as
well.
. allows us to facet the plot by all levels of the
variable provided without needing to specify the levels.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Advantages: it is easier to identify patterns within each value of a
variable; we can have a larger number of facets while still being able
to identify which facet represents which value, while with a large
number of colors, it may be difficult to discriminate between values.
Disadvantages: It may be more challenging to compare patterns between
values of the variable and to identify overall patterns.
?facet_wrap
nrow determines the number of rows in the facet. Other
options that control the layout of the individual panels are:
ncol, scale, and dir. ncol determines the
number of columns in the facet. facet_grid() does not have
number of nrow and ncol arguments because they
are determined by the number of unique levels the variables have.
Because if you put the variable with the more unique levels in the rows, it would create a taller then wider plot, which may be hard to follow and interpret on most computers.
line chart - geom_line() boxplot -
geom_boxplot() histogram -
geom_histogram()
This will create a scatter plot with displ as the x variable and hwy as the y variable. drv will be represented by both lines and errors colored by the levels of the variable, and their standard error would not appear.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
show.legend = FALSE hides the legend box in the of the plot. It plausible that it was removed to ensure the plot’s size is consistent with previous plots.
It visulizes the confidence intervals of the line as shaded areas; TRUE to show, FALSE to hide.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
No, they will be identical since Because both geom_point() and geom_smooth() will uses the same data and mapping noted in the ggplot(), which is the same for both plots.
ggplot(data=mpg, mapping=aes(x = displ, y = hwy,)) +
geom_point(size=5) +
geom_smooth(se = FALSE, size=2)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_smooth(mapping = aes(group = drv), se = FALSE, size=2) +
geom_point(size=5)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv))+
geom_smooth(se = FALSE, size=2) +
geom_point(size=5)
ggplot(data = mpg, mapping = aes (x = displ, y = hwy))+
geom_smooth(se = FALSE, size=2) +
geom_point(mapping=aes(color=drv), size=5)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_smooth(mapping=aes(linetype=drv), se = FALSE, size=2) +
geom_point(mapping = aes(color=drv), size=5)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
geom_point(size = 5, color = "white") +
geom_point(aes(color = drv))
?stat_summary
the default geom associated with stat_summary is geom_pointrange().
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
?geom_col
?geom_bar
There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.
Pairs of geoms and stats share names in common and typically have the same default stat.
?stat_smooth
Two parameters that control stat_smooth()'s behavior are
method and formula. Method specifies the kind
of smoothing function to apply, like a linear model (lm) or a
generalized linear model (glm), and formula allows the user to provide
the variables of interest for the given modeling strategy. In addition,
users can also adjust the fullrange parameter, which tells
R to either fit the smoothing for just the data or the full plot.
stat_smooth() generates the following variables:
y or x predicted value of the given variable(s)
ymin or xmin lower pointwise confidence interval around the mean
ymax or xmax upper pointwise confidence interval around the mean
se standard error
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))
The problems is that all of the bars are the same height, which is inaccurate. Setting group=1 ensures that the proportions for each stack are calculated using each subset.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
It is possible that many points in the plot overlap and, therefore, hide
each other. We can improve the plot by using the geom_jitter() function
to add a small amount of random noise to each point, which will reveal
any overlapping points. As such:
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
Width, controlling the horizontal random noise, and height, controlling the vertical random noise.
?geom_jitter
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()
While
geom_jitter adds random noise to each point to reduce
over plotting of points, geom_count changes the size of
each point per the number of observation for its specific value.
Although geom_count can help to reduce over plotting; however, it may
create over plotting by itself due to the size of the points.
?geom_boxplot
The default position adjustment for geom_boxplot() is dodge2.
ggplot(data = mpg, aes(x = drv, y = hwy)) +
geom_boxplot()
ggplot(mpg, aes(x = class, fill = drv)) + geom_bar()
ggplot(mpg, aes(x = class, fill = drv)) + geom_bar() + coord_polar()
?labs()
labs() allows users to set plot labels. Within the
labs() function, users can provide details for host of plot
labels, like title, subtitle, caption, and more.
?coord_quickmap
coord_map() from ggplot2 allows users to map their data
onto a 2D projection of the earth. Due to the curvature of the earth,
the documentation notes that map projections do not preserve straigth
lines. As such, coord_quickmap() can be used alternatively,
which performs calculations to provide an approximation of the requested
area of the globe but with straight lines.
The plot below suggests a strong positive relationship between city miles per gallon to highway miles per gallon; as one rises, so does that other.
coord_fixed() is important because it ensures a fixed ratio between the physical representation of data points on the X and Y axes. This makes it easier to identify patterns in the plot.
geom_abline() adds a reference line to a plot, either horizontal, vertical, or diagonal, which is useful for annotating plots. In this case, it added a horizontal line to the plot, which made it easier to identify the relationship between city and highway mpg.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
coord_fixed()
my_variable <- 10 my_varıable #> Error in eval(expr, envir, enclos): object ‘my_varıable’ not found
This code does not work because the “I” in the second row of the code is capital, which does not match the “i” above.
library(tidyverse)
ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
fliter(mpg, cyl = 8) filter(diamond, carat > 3)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
filter(mpg, cyl == 8)
## # A tibble: 70 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a6 quattro 4.2 2008 8 auto… 4 16 23 p mids…
## 2 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 3 chevrolet c1500 sub… 5.3 2008 8 auto… r 11 15 e suv
## 4 chevrolet c1500 sub… 5.3 2008 8 auto… r 14 20 r suv
## 5 chevrolet c1500 sub… 5.7 1999 8 auto… r 13 17 r suv
## 6 chevrolet c1500 sub… 6 2008 8 auto… r 12 17 r suv
## 7 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
## 8 chevrolet corvette 5.7 1999 8 auto… r 15 23 p 2sea…
## 9 chevrolet corvette 6.2 2008 8 manu… r 16 26 p 2sea…
## 10 chevrolet corvette 6.2 2008 8 auto… r 15 25 p 2sea…
## # … with 60 more rows
filter(diamonds, carat > 3)
## # A tibble: 32 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
## 2 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
## 3 3.01 Premium F I1 62.2 56 9925 9.24 9.13 5.73
## 4 3.05 Premium E I1 60.9 58 10453 9.26 9.25 5.66
## 5 3.02 Fair I I1 65.2 56 10577 9.11 9.02 5.91
## 6 3.01 Fair H I1 56.1 62 10761 9.54 9.38 5.31
## 7 3.65 Fair H I1 67.1 53 11668 9.53 9.48 6.38
## 8 3.24 Premium H I1 62.1 58 12300 9.44 9.4 5.85
## 9 3.22 Ideal I I1 62.6 55 12545 9.49 9.42 5.92
## 10 3.5 Ideal H I1 62.8 57 12587 9.65 9.59 6.03
## # … with 22 more rows
This chapter demonstrates the application of the
data.table in R as an alternative to tidyverse and base R
methods of data wrangling. The authors suggest that
data.table is particularly useful for working with large
volumes of data (e.g. 10 - 100 GB)
library(data.table)
Due to slow loading speeds when importing the full pisa data set,
even with fread() from data.table, we decided to use the
smaller ‘random6’ data set that the authors also made a available.
According to the authors, the random data set contains data collected
from a random sample of a single country with records from all 6 of its
regions.
# read in the random data set
random6 <- fread("random6.csv", na.strings = "")
# check object size - 0.2 gigabytes
print(object.size(random6), unit = "GB")
## 0.2 Gb
# read in pisa data set
#pisa <- fread("pisa2015.csv")
Using data.table syntax to subset the data set, we found that there
are 3,197 female students from Germany in the data set. In addition, an
alternative method using the .N function to return the
number of rows is shown below as well.
# explore data; first 5 and last 5 rows returned when printing a data.table object
#random6 # don't run for now due to loading speed
# subset all female students from the field ST004D01T
#random6[CNT == "Germany" & ST004D01T == "Female"]
# return row count with .N function and chaining
#random6[CNT == "Germany" & ST004D01T == "Female", .N] # don't print code as the output is very large in the markdown report
# first create custom function to convert bin to numeric
bin.to.num <- function(x){
if (is.na(x)) NA
else if (x == "Yes") 1L
else if (x == "No") 0L
}
# using the custom function, create new variables, computer and software, that we can use to calculate the proportion of students with access to these resources
random6[, `:=`
(female = ifelse(ST004D01T == "Female", 1, 0),
sex = ST004D01T,
# At my house we have ...
desk = sapply(ST011Q01TA, bin.to.num),
own.room = sapply(ST011Q02TA, bin.to.num),
quiet.study = sapply(ST011Q03TA, bin.to.num),
computer = sapply(ST011Q04TA, bin.to.num),
software = sapply(ST011Q05TA, bin.to.num),
internet = sapply(ST011Q06TA, bin.to.num),
lit = sapply(ST011Q07TA, bin.to.num),
poetry = sapply(ST011Q08TA, bin.to.num),
art = sapply(ST011Q09TA, bin.to.num),
book.sch = sapply(ST011Q10TA, bin.to.num),
tech.book = sapply(ST011Q11TA, bin.to.num),
dict = sapply(ST011Q12TA, bin.to.num),
art.book = sapply(ST011Q16NA, bin.to.num))]
In Germany, 95% of students in Germany have a computer in their home and 45% of students have educational software.
random6[CNTRYID == "Germany", table(computer)] # 247 No's and 5438 Yes'
## computer
## 0 1
## 247 5438
# computer prop
5438 /(5438 + 247)*100 # 95% of students in data set in Germany have a computer in
## [1] 95.65523
random6[CNTRYID == "Germany", table(software)] # 3038 No's and 2481 Yes'
## software
## 0 1
## 3038 2481
# educational software prop
2481/(2481 + 3038) # 45% have access to educational software
## [1] 0.449538
In Uruguay, 88% of students have a computer in the home and 43% of students have educational software.
random6[CNTRYID == "Uruguay", table(computer)] # 635 No's and 5099 Yes'
## computer
## 0 1
## 635 5099
# computer in home prop
5099/(5099 + 635) # 88% have access to educational software
## [1] 0.8892571
random6[CNTRYID == "Uruguay", table(software)] # 3013 No's and 2313 Yes'
## software
## 0 1
## 3013 2313
# educational software prop
2313/(2313 + 3013) # 43% have access to educational software
## [1] 0.4342846
Among female students in the random6 data set, 72% stated that they have their own room and 85% stated that they have a quiet place to study.
random6[female == 1, table(own.room)] # 4805 No's and 12644 Yes'
## own.room
## 0 1
## 4805 12644
# prop own.room
12644 / (12644 + 4805) #72%
## [1] 0.7246261
random6[female == 1, table(quiet.study)] #2606 No's and 14801 Yes'
## quiet.study
## 0 1
## 2606 14801
# prop quiet.study
14801 / (14801 + 2606) # 85%
## [1] 0.8502901
51% of students have art in their home. The average age of male and female students are 15.8 years.
# proportion of students who have art in their homes:
random6[, .(art = mean(art, na.rm = TRUE), AGE = mean(AGE, na.rm = TRUE)), by = .(sex)]
## sex art AGE
## 1: Female 0.5061029 15.81118
## 2: Male 0.4981741 15.80671
# create custom function to split a column of data by above or below median
median_cut = function(x) {
ifelse(x > median(x), "above_median", "below_median")
}
random6[,
.(own.room = mean(own.room, na.rm = TRUE),
desk_mean = mean(desk, na.rm = TRUE)),
by = .(median_cut(AGE)) ]
## median_cut own.room desk_mean
## 1: above_median 0.7439972 0.8684374
## 2: below_median 0.7441766 0.8680203
The authors show that the code below, using the melt()
function, can be used to reshape the data sets from wide to long
formats.
# recode and store as 'sci_car'; use table to confirm re-coding
random6[,
"sci_car" := sapply(PA032Q03TA,
function(x) {
if (is.na(x)) NA
else if (x == "No") -1L
else if (x == "Yes") 1L
}) ][,
table(sci_car)]
## sci_car
## -1 1
## 5104 4946
Because we are using the more limited random6 data set, we specify to return Germany and Mexico specifically, since the other countries are not in the data set.
random6[CNTRYID %in% c("Germany", "Mexico"),
.(mean(sci_car, na.rm = TRUE)),
by = .(sex, CNTRYID)]
## sex CNTRYID V1
## 1: Female Germany -0.8348018
## 2: Male Germany -0.7085292
## 3: Male Mexico 0.3975610
## 4: Female Mexico 0.3200577
# means and sd of variables hypothesized to predict sci_car
random6[,
.(environ_avg = mean(ENVAWARE, na.rm = TRUE),
environ_sd = sd(ENVAWARE, na.rm = TRUE),
joy_avg = mean(JOYSCIE, na.rm = TRUE),
joy_sd = sd(JOYSCIE, na.rm = TRUE),
self_eff_avg = mean(SCIEEFF, na.rm = TRUE),
self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
)]
## environ_avg environ_sd joy_avg joy_sd self_eff_avg self_eff_sd
## 1: -0.1688647 1.097327 0.06985556 1.117806 -0.008148365 1.204854
# means and sd of focal variables, grouped by sci_car and sex
random6[,
.(environ_avg = mean(ENVAWARE, na.rm = TRUE),
environ_sd = sd(ENVAWARE, na.rm = TRUE),
joy_avg = mean(JOYSCIE, na.rm = TRUE),
joy_sd = sd(JOYSCIE, na.rm = TRUE),
self_eff_avg = mean(SCIEEFF, na.rm = TRUE),
self_eff_sd = sd(SCIEEFF, na.rm = TRUE)
), by = .(sci_car, sex)]
## sci_car sex environ_avg environ_sd joy_avg joy_sd self_eff_avg
## 1: NA Female -0.31842610 0.9874307 -0.09216258 1.1173348 -0.171922907
## 2: -1 Female -0.01864386 0.9728241 -0.11241710 1.0385336 0.016697673
## 3: 1 Male 0.18221720 1.1622236 0.55847968 1.0094256 0.361605273
## 4: -1 Male 0.06017463 1.1645145 0.12967621 1.0792426 0.162395292
## 5: NA Male -0.20814244 1.1916179 0.06879025 1.1356331 -0.009655823
## 6: 1 Female 0.08936607 0.9659264 0.55664182 0.9397745 0.311395444
## self_eff_sd
## 1: 1.186161
## 2: 1.088183
## 3: 1.107026
## 4: 1.141781
## 5: 1.267313
## 6: 1.059443
# first store smaller data set containing observations for all focal variables and sci_car
sci_car_variables <- random6[,
.(sci_car, ENVAWARE, JOYSCIE, SCIEEFF)]
sci_car_variables[,
.(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"),
env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"),
sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"))]
## env_joy_cor joy_self_eff_cor env_sci_car_cor sci_car_joy_cor
## 1: 0.3650092 0.3482787 0.05626268 0.2666819
# recode and store as 'sci_car'; use table to confirm re-coding
random6[,
"math_split" := sapply(PV1MATH,
function(x) {
if (is.na(x)) NA
else if (x <= 490) -1L
else if (x > 490) 1L
})][,
table(math_split)]
## math_split
## -1 1
## 21578 14269
random6[,
"reading_split" := sapply(PV1READ,
function(x) {
if (is.na(x)) NA
else if (x <= 493) -1L
else if (x > 493) 1L}) ][,
table(reading_split)]
## reading_split
## -1 1
## 21191 14656
# calculate correlations with these new variables and those above
# store as table for subsequent formatting
math_reading_cor <- random6[,
.(env_joy_cor = cor(x = ENVAWARE, y = JOYSCIE, use = "complete.obs"),
joy_self_eff_cor = cor(x = JOYSCIE, y = SCIEEFF, use = "complete.obs"),
env_sci_car_cor = cor(x = ENVAWARE, y = sci_car, use = "complete.obs"),
sci_car_joy_cor = cor(x = JOYSCIE, y = sci_car, use = "complete.obs"),
math_joy_cor = cor(x = math_split, y = sci_car, use = "complete.obs"),
reading_joy_cor = cor(x = reading_split, y = sci_car, use = "complete.obs"),
math_env_cor = cor(x = ENVAWARE, y = math_split, use = "complete.obs"),
reading_env_cor = cor(x = JOYSCIE, y = reading_split, use = "complete.obs"),
math_self_eff_cor = cor(x = SCIEEFF, y = math_split, use = "complete.obs"),
reading_self_eff_cor = cor(x = SCIEEFF, y = reading_split, use = "complete.obs"))]
# pivot long using melt(); easier to view correlations this way
melt(math_reading_cor)
## variable value
## 1: env_joy_cor 0.36500924
## 2: joy_self_eff_cor 0.34827868
## 3: env_sci_car_cor 0.05626268
## 4: sci_car_joy_cor 0.26668187
## 5: math_joy_cor -0.21986258
## 6: reading_joy_cor -0.20404487
## 7: math_env_cor 0.12821305
## 8: reading_env_cor 0.05982742
## 9: math_self_eff_cor 0.02318556
## 10: reading_self_eff_cor 0.04272108
random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, .SDcols = c("JOYSCIE", "INTBRSCI")][,,by = sci_car]
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## Ignoring by= because j= is not supplied
## Warning in `[.data.table`(random6[, .(Mean = rowMeans(.SD)), by = CNTRYID, :
## i and j are both missing so ignoring the other arguments. This warning will be
## upgraded to error in future.
## CNTRYID Mean
## 1: Germany -0.68550
## 2: Germany -1.62215
## 3: Germany -0.97025
## 4: Germany 1.03440
## 5: Germany 0.06700
## ---
## 35843: Uruguay 1.16175
## 35844: Uruguay 0.22755
## 35845: Uruguay -1.29245
## 35846: Uruguay -0.31095
## 35847: Uruguay -0.54705
# create column subset to convert to numeric
cols_to_convert <- c("MISCED", "FISCED")
# convert to numeric with subset index and using lapply()
random6[ ,
(cols_to_convert) := lapply(.SD, as.numeric),
.SDcols = c("MISCED", "FISCED")]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
random6[, lapply(.SD, mean), .SDcols = c("DISCLISCI", "TEACHSUP", "IBTEACH", "TDTEACH")]
## DISCLISCI TEACHSUP IBTEACH TDTEACH
## 1: NA NA NA NA
# run descriptive statistics for three additional variables hypothesized to relate to PA032Q03TA
descriptive_stats <- random6[ ,lapply(.SD, psych::describe),
.SDcols = c("DISCLISCI", "TEACHSUP","TDTEACH")]
# reshape table into long format so that it is easier to read the output
melt(descriptive_stats)
## Warning in melt.data.table(descriptive_stats): id.vars and measure.vars are
## internally guessed when both are 'NULL'. All non-numeric/integer/logical type
## columns are considered id.vars, which in this case are columns []. Consider
## providing at least one of 'id' or 'measure' vars in future.
## variable value
## 1: DISCLISCI.vars 1.000000e+00
## 2: DISCLISCI.n 3.140600e+04
## 3: DISCLISCI.mean 1.445539e-01
## 4: DISCLISCI.sd 1.002982e+00
## 5: DISCLISCI.median 3.900000e-03
## 6: DISCLISCI.trimmed 1.504836e-01
## 7: DISCLISCI.mad 1.032186e+00
## 8: DISCLISCI.min -2.416200e+00
## 9: DISCLISCI.max 1.883700e+00
## 10: DISCLISCI.range 4.299900e+00
## 11: DISCLISCI.skew -1.627706e-01
## 12: DISCLISCI.kurtosis -2.219641e-01
## 13: DISCLISCI.se 5.659616e-03
## 14: TEACHSUP.vars 1.000000e+00
## 15: TEACHSUP.n 3.062000e+04
## 16: TEACHSUP.mean 1.259763e-01
## 17: TEACHSUP.sd 9.746136e-01
## 18: TEACHSUP.median 4.540000e-02
## 19: TEACHSUP.trimmed 1.846121e-01
## 20: TEACHSUP.mad 9.899320e-01
## 21: TEACHSUP.min -2.719500e+00
## 22: TEACHSUP.max 1.447500e+00
## 23: TEACHSUP.range 4.167000e+00
## 24: TEACHSUP.skew -3.721514e-01
## 25: TEACHSUP.kurtosis -1.845543e-01
## 26: TEACHSUP.se 5.569675e-03
## 27: TDTEACH.vars 1.000000e+00
## 28: TDTEACH.n 3.012000e+04
## 29: TDTEACH.mean -2.597529e-02
## 30: TDTEACH.sd 9.603662e-01
## 31: TDTEACH.median -1.170000e-02
## 32: TDTEACH.trimmed -3.303284e-02
## 33: TDTEACH.mad 8.873361e-01
## 34: TDTEACH.min -2.447600e+00
## 35: TDTEACH.max 2.078100e+00
## 36: TDTEACH.range 4.525700e+00
## 37: TDTEACH.skew -7.223483e-03
## 38: TDTEACH.kurtosis 3.790284e-01
## 39: TDTEACH.se 5.533621e-03
## variable value